Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use OR(<<, >>) for all rotations #5

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

oconnor663
Copy link

Discussion/experiment PR: I was playing around with different rotations, and I found that these simplified ones seem to perform a lot better under Clang/Rustc. Most of the functions improve by about 6%, but BLAKE2b improves by 14%. Have you run into this before? Is it a known Clang vs GCC thing?

@sneves
Copy link
Owner

sneves commented Dec 22, 2019

This is entirely dependent on architecture. I imagine you measured this on Skylake or Skylake-X. Also cycle counts are more useful than percentages to understand the difference.

[v]ps{r,l}l{d, q} used to be able to be dispatched via a single execution port up to and including Haswell. Thus, the sequence psrld, pslld, por necessarily took 3 cycles to run, whereas pshufb took 1 cycle (and you could run 2 of them per cycle from 45nm Core 2 to Sandy Bridge).

With Skylake [v]ps{r,l}l{d, q} can be dispatched to one of two execution ports, so the sequence psrld, pslld, por has 2 cycles of latency. In Skylake-X, clang will recognize the rotation idiom and replace it with vprord, whereas that will not happen with the shuffles.

In parallel modes, where latency does not matter much but overall throughput will, doing 4 independent rotations with pshufb will cost you exactly 4 cycles of throughput, whereas psrld, pslld, por will cost the same 4 cycles. The main difference will be the port pressure, which is concentrated on port 5 with the shuffles, and more evened out with the arithmetic.

@oconnor663
Copy link
Author

I was measuring this on the Skylake-SP server that AWS gives me, and on my Kaby Lake laptop.

What's the best way to treat these microarchitecture-specific differences? Should we generally optimize for the most modern thing? Is it worth shipping both and putting it behind a compiler flag?

@sneves
Copy link
Owner

sneves commented Dec 23, 2019

Those are all questions without definitive answers. If you don't mind the maintenance, having a version for each major microarchitecture would be the best solution. But generally the solution that works the best overall (for some definition of "overall") is preferable from a maintenance standpoint.

@sneves
Copy link
Owner

sneves commented Dec 24, 2019

So this performance optimization looks like more of a LLVM "bug" than an actual optimization. In fact, I believe I had already seen this behavior before somewhere, and then forgot about it.

The real root cause here is that for rotations by 16, LLVM prefers to use the pair vpshuflw ymm0,ymm0,0x39; vpshufhw ymm0,ymm0,0x39 rather than vpshufb ymm0,ymm0,ymm10 as instructed. With -mavx2 but in the absence of an explicit march=... directive, clang defaults to an older instruction scheduling model where pshufb either does not exist or is slower.

And the funny thing is that with your patch, the pshufb for rotation by 16 is actually generated (in fact, all of the same pshufbs are generated). By decomposing the pshufb into a vector shuffle, and reassembling it back, LLVM ends up pessimizing the code. Go figure.

So the solution is simple: where Clang/LLVM is concerned, if using AVX2 also add -march=haswell, which should choose a reasonable instruction scheduler.

@sneves
Copy link
Owner

sneves commented Dec 24, 2019

The _mm256_shuffle_bytes intrinsic appears to be decomposed into a general vector shuffle, which then gets pattern matched as a 16-lane 16-bit element shuffle, and the general-purpose method is matched before the pshufb optimization.

And the reason it's "fixed" with -march=haswell is because of this:

  // Depth threshold above which we can efficiently use variable mask shuffles.
  int VariableShuffleDepth = Subtarget.hasFastVariableShuffle() ? 2 : 3;
  AllowVariableMask &= (Depth >= VariableShuffleDepth) || HasVariableMask;

So the pair vpshuflw, vpshufhw will get reoptimized back into a single shuffle if the architecture in question is deemed to have fast variable shuffles. Haswell introduces this. The default subtarget, however, is based on Sandy Bridge and does not. It seems a bit weird to introduce this feature with Haswell; byte shuffles since the 45nm Core 2 were already single-cycle.

Anyway, long story short, this is definitely a problem with LLVM.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants